Introduction
The aim of this practical is to extend the hands-on to the comparison of brain and liver data of mouse embryo samples at 12.5 days of development.
First, run the container:
docker run -i -t -v /Users/andreab/Downloads/tutorial:/tutorial -w /tutorial ceciliaklein/teaching:uvic
After running this command, we can run the executable files in /tutorial/teaching-utils/ from anywhere in the terminal session without specifying the full path to the executable.
export PATH=$PATH:/tutorial/teaching-utils/;
For the following report we will create a new folder to store all outputs:
mkdir tutorial/final_task
mkdir tutorial/final_task/Q1 tutorial/final_task/Q2 tutorial/final_task/Q3 tutorial/final_task/Q4 tutorial/final_task/Q3 tutorial/final_task/Q5
Q1
Perform differential expression analysis between brain and liver using the EdgeR. Present results using a heatmap with hierarchical clustering in rows and columns and colored classification of differentially expressed genes (DEGs), i.e. overexpressed in brain versus liver and the other way around.
cd final_task/Q1
We choose coefficient 3 to compare brain and liver.
edgeR.analysis.R --input_matrix /tutorial/quantifications/encode.mouse.gene.expected_count.idr_NA.tsv \
--metadata /tutorial/data/gene.quantifications.index.tsv \
--fields tissue \
--coefficient 3 \
--output brain_X_liver
We write a list of the genes overexpressed in brain compared to liver, according to edgeR analysis:
awk '$NF<0.01 && $2<-10{print $1"\tover_brain_X_liver"}' edgeR.cpm1.n2.brain_X_liver.tsv > edgeR.0.02.over_brain_X_liver.txt #118
And genes overexpressed in liver compared to brain:
awk '$NF<0.01 && $2>10 {print $1"\tover_liver_X_brain"}' edgeR.cpm1.n2.brain_X_liver.tsv > edgeR.0.02.over_liver_X_brain.txt #107
Count how many overexpressed genes there are in each stage:
wc -l edgeR.0.02.over*.txt
118 edgeR.0.02.over_brain_X_liver.txt
107
edgeR.0.02.over_liver_X_brain.txt
225 total
Getting all DEGs of brain and liver with their gene type:
awk '$3=="gene"{ match($0, /gene_id "([^"]+).+gene_type "([^"]+)/, var); print var[1],var[2] }' OFS="\t" /tutorial/refs/gencode.vM4.gtf \
| join.py --file1 stdin \
--file2 <(cat edgeR.0.02.over*.txt) \
| sed '1igene\tedgeR\tgene_type' > gene.edgeR_BL.tsv
Perform a Heatmap:
# i copied the matrix from the TPM of the three tissues to a new file which i edited so i could choose only those fields for Brain and Liver
awk 'NR==1{print "-" "\t" $0; next} {print}' /tutorial/quantifications/encode.mouse.gene.TPM.idr_NA.tsv |cut -f1,2,3,6,7 > /tutorial/quantifications/encode.mouse.gene.TPM.idr_NA_BL.tsv # i shifted the first row to the right (i had to add '-' to the first field of the first row) and selected only those fields for Brain and Liver
awk 'NR == 1 {sub(/^-[\t]/, ""); print} NR > 1' /tutorial/quantifications/encode.mouse.gene.TPM.idr_NA_BL.tsv > /tutorial/quantifications/encode.mouse.gene.TPM.idr_NA_BL_fixed.tsv # i removed the first field '-' in the first row
# for the heatmap i chose the matrix file for Brain and Liver TPM
cut -f1 gene.edgeR_BL.tsv | tail -n+2 | selectMatrixRows.sh - /tutorial/quantifications/encode.mouse.gene.TPM.idr_NA_BL_fixed.tsv \
| ggheatmap.R --width 5 \
--height 8 \
--col_metadata /tutorial/data/temp_metadata.tsv \
--colSide_by tissue \
--col_labels labExpId \
--row_metadata gene.edgeR_BL.tsv \
--merge_row_mdata_on gene \
--rowSide_by edgeR,gene_type \
--row_labels none \
--log \
--pseudocount 0.1 \
--col_dendro \
--row_dendro \
--matrix_palette /tutorial/palettes/palDiverging.txt \
--colSide_palette /tutorial/palettes/palTissue2.txt \
--output heatmap.brain_X_liver.pdf
Q2
Perform gene ontology enrichment analysis of the two sets of DEGs using the command line wrapper of GOstats R package for biological processes. Plot results using any graphical representation and discuss results.
cd ../Q2
Prepare a file with the genes in the annotation:
awk '{split($10,a,/\"|\./); print a[2]}' /tutorial/refs/gencode.vM4.gtf | sort -u > universe.txt
Launch the GO enrichment script for the Biological Processes in the set of genes overexpressed in brain and liver:
The results can be visualized in a web browser, using the online Gene Ontology visualizer REVIGO.
for file in ../Q1/edgeR.0.02.over_liver_X_brain.txt ../Q1/edgeR.0.02.over_brain_X_liver.txt; do awk '{split($1,a,"."); print a[1]}' $file\
| GO_enrichment.R --universe universe.txt \
--genes stdin \
--categ BP \
--output $(basename ${file})_output \
--species mouse ;done
We copy the command output and paste it in the text box on the main page of the REVIGO web site.
awk 'NR==1{$1="% "$1}{print $1,$2}' edgeR.0.02.over_brain_X_liver.txt_output.BP.tsv
awk 'NR==1{$1="% "$1}{print $1,$2}' edgeR.0.02.over_liver_X_brain.txt_output.BP.tsv
Q3
Analyze differential splicing using SUPPA between brain and liver for skipping exon, intron retention, mutually exclusive exon and alternative first exon. Plot top results using heatmaps. Different thresholds may be chosen for each event type.
cd ../Q3
Types of events: Skipping Exon (SE), Retained intron (RI) , Mutually Exclusive Exons (MX), Alternative First (AF)
(FL : Alternative first (AF) and last (AL) exons (it generates both))
suppa.py generateEvents -i /tutorial/splicing/exon-annot.gtf -e SE RI MX FL -o localEvents -f ioe
Compute PSI for SE, RI, MX, AF:
for event in SE RI MX AF; do suppa.py psiPerEvent --total-filter 10 --ioe-file localEvents_${event}_strict.ioe --expression-file /tutorial/splicing/pc-tx.tsv -o PSI-${event}; done
Create individual files per tissue and event:
for event in SE RI MX AF; do for tissue in Brain Liver ;do echo -1 "Processing event $event and $tissue tissue"; selectMatrixColumns.sh PRNAembryo${tissue}1:PRNAembryo${tissue}2 PSI-${event}.psi > ${tissue}.${event}.psi;done;done
Compute differential splicing analysis for each event, comparing both tissues against all and performing multiple testing correction:
for event in SE RI MX AF; do
suppa.py diffSplice --method empirical \
--input localEvents_${event}_strict.ioe \
--psi Brain.${event}.psi Liver.${event}.psi \
--tpm /tutorial/splicing/expr.Brain.tsv /tutorial/splicing/expr.Liver.tsv \
-c -gc -o DS.${event} ; done
For SE and AF pval 0.05 and deltaPSI 0.5
for event in SE AF; do awk 'BEGIN{FS=OFS="\t"}NR==1{print }NR>1 && $2!="nan" && ($2>0.5 || $2<-0.5) && $3<0.05{print}' DS.${event}.dpsi |cut -f1 |grep -v Brain > top.${event}.Brain-Liver.txt ;done
For RI and MX pval 0.05 and deltaPSI 0.3
for event in RI MX ; do awk 'BEGIN{FS=OFS="\t"}NR==1{print }NR>1 && $2!="nan" && ($2>0.3 || $2<-0.3) && $3<0.05{print}' DS.${event}.dpsi |cut -f1 |grep -v Brain > top.${event}.Brain-Liver.txt ;done
Generate a heatmap for event:
for event in SE RI MX AF; do selectMatrixRows.sh top.${event}.Brain-Liver.txt DS.${event}.psivec > top.${event}.Brain-Liver.tsv; done
for event in SE RI MX AF; do ggheatmap.R -i top.${event}.Brain-Liver.tsv \
-o heatmap.${event}.psi.Brain-Liver.pdf \
--matrix_palette /tutorial/palettes/blue.9.txt \
--row_dendro \
--col_dendro \
--base_size 10; done
Heatmap Skipping Exon - Brain-Liver Heatmap Retained intron - Brain-Liver Heatmap Mutually Exclusive Exons - Brain-Liver Heatmap Alternative First - Brain-Liver
Q4
Q5
Q6
Show two examples of alternative first exons in the UCSC genome browser, including RNA-seq, ChIP-seq and ATAC-seq tracks. Discuss the integration of the three datasets in the TSS of the selected cases.
From the AF file generated in step 3:
less ../Q3/top.AF.Brain-Liver.txt
Example:
ENSMUSG00000002345.13;AF:chr8:70139707:70139772-70141864:70140277:70140356-70141864:+
Example:
ENSMUSG00000029922.11;AF:chr6:39410216-39419840:39420067:39410216-39420099:39420358:-